confidence measure
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Oceania > New Zealand (0.05)
- Oceania > Australia (0.04)
- (11 more...)
Calibrated Structured Prediction
Volodymyr Kuleshov, Percy S. Liang
In user-facing applications, displaying calibrated confidence measures-- probabilities that correspond to true frequency--can be as important as obtaining high accuracy. We are interested in calibration for structured prediction problems such as speech recognition, optical character recognition, and medical diagnosis. Structured prediction presents new challenges for calibration: the output space is large, and users may issue many types of probability queries (e.g., marginals) on the structured output. We extend the notion of calibration so as to handle various subtleties pertaining to the structured setting, and then provide a simple recalibra-tion method that trains a binary classifier to predict probabilities of interest. We explore a range of features appropriate for structured recalibration, and demonstrate their efficacy on three real-world datasets.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- North America > United States > Massachusetts (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning (0.94)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)
Beyond the Hook: Predicting Billboard Hot 100 Chart Inclusion with Machine Learning from Streaming, Audio Signals, and Perceptual Features
The advent of digital streaming platforms have recently revolutionized the landscape of music industry, with the ensuing digitalization providing structured data collections that open new research avenues for investigating popularity dynamics and mainstream success. The present work explored which determinants hold the strongest predictive influence for a track's inclusion in the Billboard Hot 100 charts, including streaming popularity, measurable audio signal attributes, and probabilistic indicators of human listening. The analysis revealed that popularity was by far the most decisive predictor of Billboard Hot 100 inclusion, with considerable contribution from instrumentalness, valence, duration and speechiness. Logistic Regression achieved 90.0% accuracy, with very high recall for charting singles (0.986) but lower recall for non-charting ones (0.813), yielding balanced F1-scores around 0.90. Random Forest slightly improved performance to 90.4% accuracy, maintaining near-perfect precision for non-charting singles (0.990) and high recall for charting ones (0.992), with F1-scores up to 0.91. Gradient Boosting (XGBoost) reached 90.3% accuracy, delivering a more balanced trade-off by improving recall for non-charting singles (0.837) while sustaining high recall for charting ones (0.969), resulting in F1-scores comparable to the other models.
- North America > United States (0.14)
- Europe > Greece > West Greece > Patra (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Oceania > New Zealand (0.05)
- Oceania > Australia (0.04)
- (11 more...)
Scalable Best-of-N Selection for Large Language Models via Self-Certainty
Kang, Zhewei, Zhao, Xuandong, Song, Dawn
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size $N$, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at https://github.com/backprop07/Self-Certainty
MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels
Liu, Xiaoou, Lin, Zhen, Da, Longchao, Chen, Chacha, Trivedi, Shubhendu, Wei, Hua
Large Language Models (LLMs) require robust confidence estimation, particularly in critical domains like healthcare and law where unreliable outputs can lead to significant consequences. Despite much recent work in confidence estimation, current evaluation frameworks rely on correctness functions -- various heuristics that are often noisy, expensive, and possibly introduce systematic biases. These methodological weaknesses tend to distort evaluation metrics and thus the comparative ranking of confidence measures. We introduce MCQA-Eval, an evaluation framework for assessing confidence measures in Natural Language Generation (NLG) that eliminates dependence on an explicit correctness function by leveraging gold-standard correctness labels from multiple-choice datasets. MCQA-Eval enables systematic comparison of both internal state-based white-box (e.g. logit-based) and consistency-based black-box confidence measures, providing a unified evaluation methodology across different approaches. Through extensive experiments on multiple LLMs and widely used QA datasets, we report that MCQA-Eval provides efficient and more reliable assessments of confidence estimation methods than existing approaches.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Singapore (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (6 more...)
- Health & Medicine (0.36)
- Education (0.35)
Reviews: A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks
The problem of the study is detecting abnormalities within deep neural networks, to detect out-of-distribution inputs, adversarial inputs, and new classes (for class incremental learning). To achieve this, the authors integrate class-conditional Gaussian distributions with a tied covariance (linear discriminant analysis) at various stages of a target neural network and construct distributions over the valid input (in-liers). They use the Mahalanobis distance measure of the Gaussian distribution as a confidence measure (proportional to the log-likelihood). They further enhance the confidence measure by taking Fast Gradient-Sign Method-style steps in the input space to increase the score. Finally, they combine the scores gathered at different layers of the neural network through a linear combination.
- Information Technology > Security & Privacy (0.40)
- Government > Military (0.40)
Towards interfacing large language models with ASR systems using confidence measures and prompting
Naderi, Maryam, Hermann, Enno, Nanchen, Alexandre, Hovsepyan, Sevada, -Doss, Mathew Magimai.
As large language models (LLMs) grow in parameter size and capabilities, such as interaction through prompting, they open up new ways of interfacing with automatic speech recognition (ASR) systems beyond rescoring n-best lists. This work investigates post-hoc correction of ASR transcripts with LLMs. To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods. Our results indicate that this can improve the performance of less competitive ASR systems.
Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation
Lin, Zhen, Trivedi, Shubhendu, Sun, Jimeng
The advent of large language models (LLMs) has dramatically advanced the state-of-the-art in numerous natural language generation tasks. For LLMs to be applied reliably, it is essential to have an accurate measure of their confidence. Currently, the most commonly used confidence score function is the likelihood of the generated sequence, which, however, conflates semantic and syntactic components. For instance, in question-answering (QA) tasks, an awkward phrasing of the correct answer might result in a lower probability prediction. Additionally, different tokens should be weighted differently depending on the context. In this work, we propose enhancing the predicted sequence probability by assigning different weights to various tokens using attention values elicited from the base LLM. By employing a validation set, we can identify the relevant attention heads, thereby significantly improving the reliability of the vanilla sequence probability confidence measure. We refer to this new score as the Contextualized Sequence Likelihood (CSL). CSL is easy to implement, fast to compute, and offers considerable potential for further improvement with task-specific prompts. Across several QA datasets and a diverse array of LLMs, CSL has demonstrated significantly higher reliability than state-of-the-art baselines in predicting generation quality, as measured by the AUROC or AUARC.
- North America > United States > Illinois (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Africa > Middle East > Egypt (0.04)
- (9 more...)
- Health & Medicine (1.00)
- Transportation > Air (0.68)